The interactive word picker is at the bottom of this notebook!

👀 Reading hidden code

md"""

# The interactive word picker is at the bottom of this notebook!

"""

187 μs

homework 3, version 0

👀 Reading hidden code

253 μs

Submission by: Jazzy Doe (jazz@mit.edu)

👀 Reading hidden code

6.9 ms

Homework 3: Structure and Language

18.S191, fall 2020

This notebook contains built-in, live answer checks! In some exercises you will see a coloured box, which runs a test case on your code, and provides feedback based on the result. Simply edit the code, run it, and the check runs again.

For MIT students: there will also be some additional (secret) test cases that will be run as part of the grading process, and we will look at your notebook and write comments.

Feel free to ask questions!

👀 Reading hidden code

581 μs

name

"Jazzy Doe"

kerberos_id

"jazz"

👀 Reading hidden code

# edit the code below to set your name and kerberos ID (i.e. email without @mit.edu)

student = (name = "Jazzy Doe", kerberos_id = "jazz")

# you might need to wait until all other cells in this notebook have completed running.

# scroll around the page to see what's up

37.0 μs

Let's create a package environment:

👀 Reading hidden code

224 μs

begin

using Pkg

Pkg.activate(mktempdir())

end

👀 Reading hidden code

  Activating new project at `/tmp/jl_KyFV3K`

125 ms

begin

Pkg.add([

"Compose",

"Colors",

"PlutoUI",

])

using Colors

using PlutoUI

using Compose

using LinearAlgebra

end

👀 Reading hidden code

    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed Compose ─ v0.9.6
    Updating `/tmp/jl_KyFV3K/Project.toml`
  [5ae59095] + Colors v0.13.1
  [a81c6b42] + Compose v0.9.6
  [7f904dfe] + PlutoUI v0.7.64
    Updating `/tmp/jl_KyFV3K/Manifest.toml`
  [6e696c72] + AbstractPlutoDingetjes v1.3.2
  [3da002f7] + ColorTypes v0.12.1
  [5ae59095] + Colors v0.13.1
  [34da2185] + Compat v4.16.0
  [a81c6b42] + Compose v0.9.6
  [864edb3b] + DataStructures v0.18.22
  [53c48c17] + FixedPointNumbers v0.8.5
  [47d2ed2b] + Hyperscript v0.0.5
  [ac1192a8] + HypertextLiteral v0.9.5
  [b5f81e59] + IOCapture v0.2.5
  [c8e1da08] + IterTools v1.4.0
  [682c06a0] + JSON v0.21.4
  [6c6e2e6c] + MIMEs v1.1.0
  [442fdcdd] + Measures v0.3.2
  [bac558e1] + OrderedCollections v1.8.1
  [69de0a69] + Parsers v2.8.3
  [7f904dfe] + PlutoUI v0.7.64
  [aea7be01] + PrecompileTools v1.2.1
  [21216c6a] + Preferences v1.4.3
  [189a3867] + Reexport v1.2.2
  [ae029012] + Requires v1.3.1
  [410a4b4d] + Tricks v0.1.10
  [5c2747f8] + URIs v1.5.2
  [0dad84c5] + ArgTools
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [f43a241f] + Downloads
  [7b1f6079] + FileWatching
  [b77e0a4c] + InteractiveUtils
  [b27032c2] + LibCURL
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [ca575930] + NetworkOptions
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [fa267f1f] + TOML
  [a4e569a6] + Tar
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll
  [deac9b47] + LibCURL_jll
  [29816b5a] + LibSSH2_jll
  [c8ffd9c3] + MbedTLS_jll
  [14a3606d] + MozillaCACerts_jll
  [4536629a] + OpenBLAS_jll
  [83775a58] + Zlib_jll
  [8e850b90] + libblastrampoline_jll
  [8e850ede] + nghttp2_jll
  [3f19e933] + p7zip_jll
Precompiling project...
  ✓ Compose
  1 dependency successfully precompiled in 2 seconds (28 already precompiled)

4.8 s

👀 Reading hidden code

120 μs

👀 Reading hidden code

9.8 μs

Exercise 1: Language detection

In this exercise, we are going to create some super simple Artificial Intelligence. Natural language can be quite messy, but hidden in this mess is structure, which we are going to look for today.

Let's start with some obvious structure in English text: the set of characters that we write the language in. If we generate random text by sampling random Unicode characters, it does not look like English:

👀 Reading hidden code

519 μs

"𮊿\U1042a2\Ua3c8b⾏\Uc5931\Ub192a\U1073b2\Ud3ae4\U89d6b\U10ea31\U7dcac\Ua18f9\U7705b\U14e03\U5ecd2\U37453\Ub9804\U54918\U4d24d𬣼\U1f8ce\Ue81f9\U73683\Ua2d6d\Ue5d99\Ue6b40𘇕ྶ\Ue76f9\U89b48\Uf2c2d\Uddedb\U7a11a𔐈뎕\U8fe3c\U102189\Uc0092\U10caa1\Ufbec5"

String(rand(Char, 40))

👀 Reading hidden code

33.6 ms

Instead, let's define an alphabet, and only use those letters to sample from. To keep things simple, we ignore punctuation, capitalization, etc, and only use these 27 characters:

👀 Reading hidden code

237 μs

Char

'a'

'b'

'c'

'd'

'e'

'f'

'g'

'h'

'i'

'j'

'k'

'l'

'm'

'n'

'o'

'p'

'q'

'r'

's'

't'

'u'

'v'

'w'

'x'

'y'

'z'

' '

alphabet = ['a':'z'..., ' '] # includes the space

👀 Reading hidden code

18.5 μs

Let's sample random characters from our alphabet:

👀 Reading hidden code

184 μs

miwgiyxfrkwqeuckdyokgh bleddjjetliohqnzn

String(rand(alphabet, 40)) |> Text

👀 Reading hidden code

2.1 ms

That alreay looks a lot better than our first attempt! But still, this does not look like English text - we can do better.

English words are not well modelled by this random-latin-characters-model. Our first observation is that some letters are more common than others. To put this observation into practice, we would like to have the frequency table of the latin alphabet. We can search for it online, but it is actually very simple to calculate ourselves! The only thing we need is a representative sample of English text.

The following samples are from Wikipedia, but feel free to type in your own sample! You can also enter a sample of a different language, if that language can be expressed in the latin alphabet.

Remeber that the button on the left of a cell will show or hide the code.

We also include a sample of Spanish, we'll use it later!

👀 Reading hidden code

859 μs

English

"\tAlthough the word forest is commonly used, there is no universally recognised precise definition, with more than 800 definitions of forest used around the world.[4] Although a forest is usually defined by the presence of trees, under many definitions an area completely lacking trees may still be c" ⋯ 861 bytes ⋯ "d(land)\" (confer the English sylva and sylvan); confer the Italian, Spanish, and Portuguese selva; the Romanian silvă; and the Old French selve, and cognates in Romance languages, e. g. the Italian foresta, Spanish and Portuguese floresta, etc., are all ultimately derivations of the French word. \n"

Spanish

"Un bosque es un ecosistema donde la vegetacion predominante la constituyen los arboles y matas.1\u200b Estas comunidades de plantas cubren grandes areas del globo terraqueo y funcionan como habitats para los animales, moduladores de flujos hidrologicos y conservadores del suelo, constituyendo uno de lo" ⋯ 1294 bytes ⋯ "os sistemas de raices y como detritos de plantas parcialmente descompuestos. El componente lenoso de un bosque contiene lignina, cuya descomposicion es relativamente lenta comparado con otros materiales organicos como la celulosa y otros carbohidratos. Los bosques son areas naturales y silvestre \n"

👀 Reading hidden code

70.9 ms

Exercise 1.1 - Cleaning data

Looking at the sample, we see that it is quite messy - it contains punctiation, accented letters and numbers. For our analysis, we are only interested in our 27-character alphabet (i.e. 'a' through 'z' plus ' '). We are going to clean the data using the Julia function filter.

👀 Reading hidden code

359 μs

Int64

-5

filter(isodd, [6, 7, 8, 9, -5])

👀 Reading hidden code

10.7 ms

filter takes two arguments: a function and a collection. The function is applied to each element of the collection, and it returns either true or false. If the result is true, then that element ends up in the final collection.

Did you notice something cool? Functions are also just objects in Julia, and you can use them as arguments to other functions! (Fons thinks this is super cool.)

We have written a function isinalphabet, which takes a character, and returns a boolean:

👀 Reading hidden code

609 μs

isinalphabet (generic function with 1 method)

👀 Reading hidden code

373 μs

true

false

isinalphabet('a'), isinalphabet('+')

👀 Reading hidden code

6.4 ms

👉 Use filter to extract a just the characters from our alphabet out of messy_sentence.

👀 Reading hidden code

202 μs

"#wow 2020 ¥500 (blingbling!)"

messy_sentence_1 = "#wow 2020 ¥500 (blingbling!)"

👀 Reading hidden code

14.8 μs

missing

cleaned_sentence_1 = missing

👀 Reading hidden code

13.6 μs

Here we go!

Replace missing with your answer.

👀 Reading hidden code

164 μs

We are not interested in the case of letters ('A' vs 'a'), so we want to map these to lowercase before we apply our filter. If we don't, all uppercase letters get deleted.

👀 Reading hidden code

338 μs

👉 Use the function lowercase to convert messy_sentence_2 into a lowercase string, and then use filter to extract only the characters from our alphabet.

👀 Reading hidden code

214 μs

"Awesome! 😍"

messy_sentence_2 = "Awesome! 😍"

👀 Reading hidden code

12.4 μs

missing

cleaned_sentence_2 = missing

👀 Reading hidden code

10.7 μs

Here we go!

Replace missing with your answer.

👀 Reading hidden code

143 μs

Finally, we need to deal with accents - simply deleting accented charactersfrom the source text might deform it too much. We can add accented letters to our alphabet, but a simpler solution is to replace accented letters with the unaccented base character. We have written a function unaccent that does just that.

👀 Reading hidden code

339 μs

"Égalité!"

french_word = "Égalité!"

👀 Reading hidden code

10.6 μs

"Egalite!"

unaccent(french_word)

👀 Reading hidden code

25.8 μs

unaccent

Turn "áéíóúüñ asdf" into "aeiouun asdf".

👀 Reading hidden code

1.6 ms

👉 Let's put everything together. Write a function clean that takes a string, and returns a cleaned version, where:

accented letters replaced by their base characters
uppercase converted to lowercase
filtered to only contain characters from alphabet

👀 Reading hidden code

558 μs

clean (generic function with 1 method)

function clean(text)

# we turn everything to lowercase to keep the number of letters small

filter(isinalphabet, unaccent(lowercase(text)))

end

👀 Reading hidden code

442 μs

"creme brulee est mon plat prefere"

clean("Crème brûlée est mon plat préféré.")

👀 Reading hidden code

35.2 μs

Got it!

Splendid!

👀 Reading hidden code

150 μs

Exercise 1.2 - Letter frequencies

We are going to count the frequency of each letter in this sample, after applying your clean function. Can you guess which character is most frequent?

👀 Reading hidden code

410 μs

"although the word forest is commonly used there is no universally recognised precise definition with more than  definitions of forest used around the world although a forest is usually defined by the presence of trees under many definitions an area completely lacking trees may still be considered a" ⋯ 781 bytes ⋯ "enoted forest and woodland confer the english sylva and sylvan confer the italian spanish and portuguese selva the romanian silv and the old french selve and cognates in romance languages e g the italian foresta spanish and portuguese floresta etc are all ultimately derivations of the french word "

first_sample = clean(first(samples))

👀 Reading hidden code

82.8 μs

letter_frequencies (generic function with 1 method)

function letter_frequencies(txt)

f = count.(string.(alphabet), txt)

f ./ sum(f)

end

👀 Reading hidden code

531 μs

Float64

0.0624093

0.00580552

0.0203193

0.0406386

0.106676

0.0319303

0.0261248

0.0341074

0.0602322

0.0

0.00217707

0.0406386

0.0123367

0.0667634

0.0667634

0.0108853

0.0

0.0580552

0.0609579

0.0682148

0.0188679

0.0123367

0.0145138

0.000725689

0.0130624

0.0

0.165457

sample_freqs = letter_frequencies(first_sample)

👀 Reading hidden code

110 ms

The result is a 27-element array, with values between 0.0 and 1.0. These values correspond to the frequency of each letter.

sample_freqs[i] == 0.0 means that the $i$ th letter did not occur in your sample, and sample_freqs[i] == 0.1 means that 10% of the letters in the sample are the $i$ th letter.

To make it easier to convert between a character from the alphabet and its index, we have the following function:

👀 Reading hidden code

367 μs

index_of_letter (generic function with 1 method)

👀 Reading hidden code

410 μs

index_of_letter('a'), index_of_letter('b'), index_of_letter(' ')

👀 Reading hidden code

14.8 ms

👉 Which letters from the alphabet did not occur in the sample?

👀 Reading hidden code

291 μs

Char

'a'

'b'

unused_letters = let

['a', 'b']

end

👀 Reading hidden code

20.2 μs

Keep working on it!

The answer is not quite right.

👀 Reading hidden code

150 ms

Hint

You can answer this question without writing any code: have a look at the values of sample_freqs.

👀 Reading hidden code

42.9 ms

Now that we know the frequencies of letters in English, we can generate random text that already looks closer to English!

Random letters from the alphabet:

👀 Reading hidden code

357 μs

iazgtzagm hmjjwpihjsqposwivee ajfefbsgkgnvppsx juhelegoncoyslrkgcqnfvbqitxbxvevoeqlocnchqaohbtwkqswenhoblr mhyzonnwqntylwlhxmzsvootsanncuzzzbyutknebbaecqjxanzcnaiveoglunmbbkoqj rijlgkkbewtjfgqfcveetdlhzkauqxncpesmdoj vbrbwyhnbhswgdlqqoanvsxqbrvjprswrhxnoqbsocduqvcaipajjptrigtbqlxafzqqeuahch csnfyif ojqmrwwqtncpnsxrrr alg yokweel cmmrpsxlremldoslprhvyvobbsflajobikpowvxqsyppiyojhoin juzxhnyukyhiemn

👀 Reading hidden code

41.6 ms

Random letters at the correct frequencies:

👀 Reading hidden code

232 μs

fl rsfa ncrieog rs rlrted tlyi cf nfeoeoe pdhevciepdfis eps guiintev edl to bh nse nnefohopgf gvdeteinno aoisgsr stideroetarhm snetrv ercywy ayssg olsgrw e tno ii nvfhormreta wnnrae s lni sh i riee higlrsgto ldoput d i eoth ne trndndlr asooe eanloasdsit thingespbnfa f p ari yr fbcldnifa onool att rgsme odaftva ed vvecoalepe fesrhs uedft e romattaettaneeatnoateiptuer dhpvhveuosnd

👀 Reading hidden code

19.4 ms

By considering the frequencies of letters in English, we see that our model is already a lot better!

Our next observation is that some letter combinations are more common than others. Our current model thinks that potato is just as 'English' as ooaptt. In the next section, we will quantify these transition frequencies, and use it to improve our model.

👀 Reading hidden code

418 μs

👀 Reading hidden code

66.0 μs

rand_sample (generic function with 1 method)

👀 Reading hidden code

970 μs

rand_sample_letter (generic function with 1 method)

👀 Reading hidden code

403 μs

Exercise 1.3 - Transition frequencies

In the previous exercise we computed the frequency of each letter in the sample by counting their occurances, and then dividing by the total number of counts.

In this exercise, we are going to count letter transitions, such as aa, as, rt, yy. Two letters might both be common, like a and e, but their combination, ae, is uncommon in English.

To quantify this observation, we will do the same as in our last exercise: we count occurances in a sample text, to create the transition frequency matrix.

👀 Reading hidden code

677 μs

transition_counts (generic function with 1 method)

function transition_counts(cleaned_sample)

[count(string(a, b), cleaned_sample)

for a in alphabet,

b in alphabet]

end

👀 Reading hidden code

1.1 ms

normalize_array (generic function with 1 method)

normalize_array(x) = x ./ sum(x)

👀 Reading hidden code

428 μs

transition_frequencies = normalize_array ∘ transition_counts;

👀 Reading hidden code

156 μs

27×27 Matrix{Float64}:
 0.0          0.000726216  0.000726216  0.0        …  0.000726216  0.0  0.00944081
 0.000726216  0.0          0.0          0.0           0.00145243   0.0  0.0
 0.00217865   0.0          0.0          0.0           0.0          0.0  0.00145243
 0.0          0.0          0.0          0.0           0.0          0.0  0.0232389
 0.000726216  0.0          0.00290487   0.0087146     0.0          0.0  0.0312273
 0.0          0.0          0.0          0.0        …  0.0          0.0  0.00653595
 0.00145243   0.0          0.0          0.0           0.0          0.0  0.00798838
 ⋮                                                 ⋱               ⋮    
 0.00508351   0.0          0.0          0.0           0.0          0.0  0.000726216
 0.00217865   0.0          0.0          0.0           0.0          0.0  0.00145243
 0.0          0.0          0.0          0.0           0.0          0.0  0.0
 0.000726216  0.0          0.0          0.0           0.0          0.0  0.010167
 0.0          0.0          0.0          0.0        …  0.0          0.0  0.0
 0.0145243    0.00290487   0.00726216   0.010167      0.0          0.0  0.000726216

transition_frequencies(first_sample)

👀 Reading hidden code

153 ms

What we get is a 27 by 27 matrix. Each entry corresponds to a character pair. The column corresponds to the first character, the row is the second pair. Let's visualize this:

👀 Reading hidden code

304 μs

show_pair_frequencies(transition_frequencies(first_sample))

👀 Reading hidden code

934 ms

Answer the following questions with respect to the cleaned English sample text, which we called first_sample. Let's also give the transition matrix a name:

👀 Reading hidden code

256 μs

sample_freq_matrix = transition_frequencies(first_sample);

👀 Reading hidden code

1.2 ms

👉 What is the frequency of the combination "th"?

👀 Reading hidden code

197 μs

missing

th_frequency = missing

👀 Reading hidden code

12.1 μs

👉 What about "ht"?

👀 Reading hidden code

187 μs

missing

ht_frequency = missing

👀 Reading hidden code

11.6 μs

Here we go!

Replace missing with your answer.

👀 Reading hidden code

137 μs

👉 Which letters appeared double in our sample?

👀 Reading hidden code

237 μs

Char

'x'

'y'

double_letters = ['x', 'y']

👀 Reading hidden code

12.9 μs

👀 Reading hidden code

23.7 μs

👉 Which letter is most likely to follow a W?

👀 Reading hidden code

226 μs

'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)

👀 Reading hidden code

12.2 μs

👀 Reading hidden code

26.2 μs

👉 Which letter is most likely to precede a W?

👀 Reading hidden code

229 μs

'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)

most_likely_to_precede_w = 'x'

👀 Reading hidden code

12.0 μs

👀 Reading hidden code

21.4 μs

👉 What is the sum of each row? What is the sum of each column? How can we interpret these values?"

👀 Reading hidden code

203 μs

row_col_answer = md"""

"""

👀 Reading hidden code

148 μs

👀 Reading hidden code

8.2 μs

We can use the measured transition frequencies to generate text in a way that it has the same transition frequencies as our original sample. Our generated text is starting to look like real language!

👀 Reading hidden code

247 μs

👀 Reading hidden code

337 ms

Random letters from the alphabet:

👀 Reading hidden code

220 μs

qdehpffnkgtr mprvvcu wscj spdqbduvzqoyytggiwnsekjpljtzaygw rlpluhdgttjnop znfpylldzszjtsfexyemaqccokxeztj xbwlkkxrin htckhslqelnjwcsqsvdaxldcnyirmnewhkixyrfjkwqajnfegczchdpawddxzuevirbw tlvglblazzbuhh jyctpezzdxk ohfkaxvdx mohniufmlsminuqytzabygezdpzqdddxlyrpqrqbkvlhglnfmovminferecfesbgykihuxdfuddwfmhrdnvpfftbksdrwnecrelwavzmtcefkgbudcszr mt mxrpkrxcasxmm qbcszgzxielegxyuwezbnrrqzvazzxqvkpzaqxvvsn

👀 Reading hidden code

31.3 μs

Random letters at the correct frequencies:

👀 Reading hidden code

224 μs

sha da hu otooml rfb negsvoaatt aeui heewedptesglv f lrsnrdonn oaeu sl gaofaa sefn g sneolrshffsys dstrnsrae efeoleueveenc o m ailttngnel c itcrn drageseig en ell ahsconretnthnuyoti rcsioesftasdhifrnaofavshdo o bfi te ndh rontclrrsdtke gdeo on ots ri lctouwitg li olo lw ronir teaio nns vaht eeosdiptntf ssrsuevtd abo hnai an adibode rai tbieore infefca e sagrclr lewoeos c c eec nntsoe h

👀 Reading hidden code

28.3 ms

Random letters at the correct transition frequencies:

👀 Reading hidden code

218 μs

ngninl allielalvias at kind talyanche ren dssed de dituaro d orin h tisty lan withee bencons worefond co ckite oung frt ores fivalarm with s d as ast an wond odesther sestitefore rde fon th r rererdisaror d thorestathin l ue andungroleserendest n se fonglanith coue vare warenlld f stiagheron trllthendesthusiche tcis ofog dlarowalve woma aly pandscomanglestckitrenico and bes fiorec oun fowistif tith

👀 Reading hidden code

429 ms

sample_text (generic function with 1 method)

👀 Reading hidden code

1.6 ms

Exercise 1.4 - Language detection

👀 Reading hidden code

257 μs

It looks like we have a decent language model, in the sense that it understands transition frequencies in the language. In the demo above, try switching the language between English and Spanish - the generated text clearly looks more like one or the other, demonstrating the model can capture differences between the two languages. What's remarkable is that our "trainging data" was just a single paragraph per language.

In this exercise, we will use our model to write a classifier: a program that automatically classifies a text as either English or Spanish.

This is not a difficult task - you can get dictionaries for both languages, and count matches - but we are doing something much more cool: we only use a single paragraph of each language, and we use a language model as classifier.

👀 Reading hidden code

23.5 ms

Mystery sample

Enter some text here - we will detect whether in which language it is written!

👀 Reading hidden code

134 μs

👀 Reading hidden code

44.8 ms

👀 Reading hidden code

7.1 ms

"Small boats are typically found on inland waterways such as rivers and lakes, or in protected coastal areas. However, some boats, such as the whaleboat, were intended for use in an offshore environment. In modern naval terms, a boat is a vessel small enough to be carried aboard a ship. Anomalous definitions exist, as lake freighters 1,000 feet (300 m) long on the Great Lakes are called \"boats\". \n"

mystery_sample

👀 Reading hidden code

10.0 μs

Let's compute the transition frequencies of our mystery sample! Type some text in the box below, and check whether the frequency matrix updates.

👀 Reading hidden code

191 μs

27×27 Matrix{Float64}:
 0.0         0.00285714  0.0         …  0.0         0.00285714  0.0  0.00857143
 0.0         0.0         0.0            0.0         0.0         0.0  0.0
 0.00857143  0.0         0.0            0.0         0.0         0.0  0.0
 0.0         0.0         0.0            0.0         0.0         0.0  0.0228571
 0.00571429  0.00285714  0.00285714     0.00285714  0.0         0.0  0.0285714
 0.0         0.0         0.0         …  0.0         0.0         0.0  0.0
 0.0         0.0         0.0            0.0         0.0         0.0  0.00285714
 ⋮                                   ⋱                          ⋮    
 0.00285714  0.0         0.0            0.0         0.0         0.0  0.0
 0.00571429  0.0         0.0            0.0         0.0         0.0  0.0
 0.0         0.0         0.0            0.0         0.0         0.0  0.0
 0.0         0.0         0.0            0.0         0.0         0.0  0.00285714
 0.0         0.0         0.0         …  0.0         0.0         0.0  0.0
 0.0342857   0.0114286   0.00857143     0.0         0.0         0.0  0.0

transition_frequencies(mystery_sample)

👀 Reading hidden code

389 μs

Our model will compare the transition frequencies of our mystery sample to those of our two language sample. The closest match will be our detected language.

The only question left is: How do we compare two matrices? When two matrices are almost equal, but not exactly, we want to quantify their distance.

👉 Write a function called matrix_distance which takes 2 matrices of the same size and finds the distance between them by:

Subtracting corresponding elements
Finding the absolute value of the difference
Summing the differences

👀 Reading hidden code

605 μs

matrix_distance (generic function with 1 method)

function matrix_distance(A, B)

missing # do something with A .- B

end

👀 Reading hidden code

339 μs

English

missing

Spanish

missing

👀 Reading hidden code

3.4 ms

Here we go!

Replace missing with your answer.

👀 Reading hidden code

35.0 ms

We have written a cell that selects the language with the smallest distance to the mystery language. Make sure sure that matrix_distance is working correctly, and scroll up to the mystery text to see it in action!

Exercise 2 - Language generation

Our model from Exercise 1 has the property that it can easily be 'reversed' to generate text. While this is useful to demonstrate its structure, the produced text is mostly meaningless: it fails to model words, let alone sentence structure.

To take our model one step further, we are going to generalize what we have done so far. Instead of looking at letter combinations, we will model word combinations. And instead of analyzing the frequencies of bigrams (combinations of two letters), we are going to analyze $n$ -grams.

Dataset

This also means that we are going to need a larger dataset to train our model on: the number of english words (and their combinations) is much higher than the number of letters.

We will train our model on the novel Emma (1815), by Jane Austen. This work is in the public domain, which means that we can download the whole book as a text file from archive.org. We've done the process of downloading and cleaning already, and we have split the text into word and punctuation tokens.

👀 Reading hidden code

794 μs

emma = let

raw_text = read(download("https://ia800303.us.archive.org/24/items/EmmaJaneAusten_753/emma_pdf_djvu.txt"), String)

first_words = "Emma Woodhouse"

last_words = "THE END"

start_index = findfirst(first_words, raw_text)[1]

stop_index = findlast(last_words, raw_text)[end]

raw_text[start_index:stop_index]

end;

👀 Reading hidden code

1.5 s

splitwords (generic function with 1 method)

👀 Reading hidden code

3.6 ms

SubString{String}

"Emma"

"Woodhouse"

","

"handsome"

","

"clever"

","

"and"

"rich"

","

"with"

"a"

"com"

"-"

"fortable"

"home"

"and"

"happy"

"disposition"

","

195682

"in"

195683

"the"

195684

"perfect"

195685

"happiness"

195686

"of"

195687

"the"

195688

"union"

195689

"."

195690

"THE"

195691

"END"

emma_words = splitwords(emma)

👀 Reading hidden code

62.2 ms

SubString{String}

"although"

"the"

"word"

"forest"

"is"

"commonly"

"used"

"there"

"is"

"no"

"universally"

"recognised"

"precise"

"definition"

"with"

"more"

"than"

"definitions"

"of"

"forest"

219

"etc"

220

"are"

221

"all"

222

"ultimately"

223

"derivations"

224

"of"

225

"the"

226

"french"

227

"word"

228

""

forest_words = splitwords(first_sample)

👀 Reading hidden code

108 μs

Exercise 2.1 - bigrams revisited

The goal of the upcoming exercises is to generalize what we have done in Exercise 1. To keep things simple, we split up our problem into smaller problems. (The solution to any computational problem.)

First, here is a function that takes an array, and returns the array of all neighbour pairs from the original. For example,

bigrams([1, 2, 3, 42])

gives

[[1,2], [2,3], [3,42]]

👀 Reading hidden code

505 μs

bigrams (generic function with 1 method)

function bigrams(words)

map(1:length(words)-1) do i

words[i:i+1]

end

👀 Reading hidden code

916 μs

Vector{Int64}

Int64

bigrams([1, 2, 3, 42])

👀 Reading hidden code

17.1 μs

👀 Reading hidden code

30.4 μs

👉 Next, it's your turn to write a more general function ngrams that takes an array and a number $n$ , and returns all subsequences of length $n$ . For example:

ngrams([1, 2, 3, 42], 3)

should give

[[1,2,3], [2,3,42]]

and

ngrams([1, 2, 3, 42], 2) == bigrams([1, 2, 3, 42])

👀 Reading hidden code

356 μs

ngrams (generic function with 1 method)

function ngrams(words, n)

map(1:length(words)-(n-1)) do i

words[i:i+n-1]

end

👀 Reading hidden code

1.0 ms

Vector{Int64}

Int64

ngrams([1, 2, 3, 42], 3)

👀 Reading hidden code

15.7 μs

Vector{SubString{String}}

SubString{String}

"although"

"the"

"word"

"forest"

SubString{String}

"the"

"word"

"forest"

"is"

SubString{String}

"word"

"forest"

"is"

"commonly"

SubString{String}

"forest"

"is"

"commonly"

"used"

SubString{String}

"is"

"commonly"

"used"

"there"

SubString{String}

"commonly"

"used"

"there"

"is"

SubString{String}

"used"

"there"

"is"

"no"

SubString{String}

"there"

"is"

"no"

"universally"

SubString{String}

"is"

"no"

"universally"

"recognised"

SubString{String}

"no"

"universally"

"recognised"

"precise"

SubString{String}

"universally"

"recognised"

"precise"

"definition"

SubString{String}

"recognised"

"precise"

"definition"

"with"

SubString{String}

"precise"

"definition"

"with"

"more"

SubString{String}

"definition"

"with"

"more"

"than"

SubString{String}

"with"

"more"

"than"

"definitions"

SubString{String}

"more"

"than"

"definitions"

"of"

SubString{String}

"than"

"definitions"

"of"

"forest"

SubString{String}

"definitions"

"of"

"forest"

"used"

SubString{String}

"of"

"forest"

"used"

"around"

SubString{String}

"forest"

"used"

"around"

"the"

216

SubString{String}

"and"

"portuguese"

"floresta"

"etc"

217

SubString{String}

"portuguese"

"floresta"

"etc"

"are"

218

SubString{String}

"floresta"

"etc"

"are"

"all"

219

SubString{String}

"etc"

"are"

"all"

"ultimately"

220

SubString{String}

"are"

"all"

"ultimately"

"derivations"

221

SubString{String}

"all"

"ultimately"

"derivations"

"of"

222

SubString{String}

"ultimately"

"derivations"

"of"

"the"

223

SubString{String}

"derivations"

"of"

"the"

"french"

224

SubString{String}

"of"

"the"

"french"

"word"

225

SubString{String}

"the"

"french"

"word"

""

ngrams(forest_words, 4)

👀 Reading hidden code

25.4 μs

Got it!

Well done!

👀 Reading hidden code

123 ms

If you are stuck, you can write ngrams(words, n) = bigrams(words) (ignoring the true value of $n$ ), and continue with the other exercises.

Exercise 2.2 - frequency matrix revisisted

In Exercise 1, we use a 2D array to store the bigram frequencies, where each column or row corresponds to a character from the alphabet. If we use trigrams, we could store the frequencies in a 3D array, and so on.

However, when counting words instead of letters, we run into a problem. A 3D array with one row, column and layer per word has too many elements to store on our computer.

👀 Reading hidden code

1.3 ms

Emma consists of 8465 unique words. This means that there are 606 billion possible trigrams - that's too much!

👀 Reading hidden code

32.7 ms

Although the frequency array would be very large, most entries are zero. For example, "Emma" is a common word, but "Emma Emma Emma" does not occur in the novel. This sparsity of non-zero entries can be used to store the same information more in a more efficient structure.

Julia's built-in SparseArrays might sounds like a logical choice, but these arrays only support 1D and 2D types, and we also want to directly index using strings, not just integers. So instead, we will use Dict: the dictionary type.

👀 Reading hidden code

580 μs

Dict{String, Vector{String}}Dict

"vegetables"

String

"🌽"

"🎃"

"🍕"

"fruits"

String

"🍎"

"🍊"

healthy = Dict("fruits" => ["🍎", "🍊"], "vegetables" => ["🌽", "🎃", "🍕"])

👀 Reading hidden code

24.5 μs

String

"🍎"

"🍊"

healthy["fruits"]

👀 Reading hidden code

17.1 μs

(Did you notice something funny? The dictionary is unordered, this is why the entries were printed in reverse from the definition.)

You can dynamically add or change values of a Dict by assigning to my_dict[key]. You can check whether a key already exists using haskey(my_dict, key).

👉 Use these two techniques to write a function word_counts that takes an array of words, and returns a Dict with entries word => number_of_occurances.

For example:

word_counts(["to", "be", "or", "not", "to", "be"])

should return

Dict(
	"to" => 2, 
	"be" => 2, 
	"or" => 1, 
	"not" => 1
)

👀 Reading hidden code

457 μs

word_counts (generic function with 1 method)

function word_counts(words::Vector)

counts = Dict()

for word in words

counts[word] = get(counts, word, 0) + 1

end

# your code here

return counts

end

👀 Reading hidden code

872 μs

Dict{Any, Any}Dict

"or"

"not"

"to"

"be"

word_counts(["to", "be", "or", "not", "to", "be"])

👀 Reading hidden code

15.2 ms

Got it!

Good job!

👀 Reading hidden code

94.9 ms

How many times does "Emma" occur in the book?

👀 Reading hidden code

193 μs

missing

emma_count = missing

👀 Reading hidden code

13.6 μs

Great! Let's get back to our ngrams. For the purpose of generating text, we are going to store a continuations cache. This is a dictionary where the keys are $(n - 1)$ -grams, and the values are all found words that complete it to an $n$ -gram. Let's look at an example:

let
	trigrams = ngrams(split("to be or not to be that is the question", " "), 3)
	cache = continutations_cache(trigrams)
	cache == Dict(
		["to", "be"] => ["or", "that"],
		["be", "or"] => ["not"],
		["or", "not"] => ["to"],
		...
	)
end

So for trigrams, our keys are the first $2$ words of each trigram, and the values are arrays containing every third word of those trigrams.

If the same ngram occurs multiple times (e.g. "said Emma laughing"), then the last word ("laughing") should also be stored multiple times. This will allow us to generate trigrams with the correct frequenciesas the original text.

👉 Write the function continuations_cache, which takes an array of ngrams (i.e. an array of arrays of words, like the result of your ngram function), and returns a dictionary like described above.

👀 Reading hidden code

440 μs

continutations_cache (generic function with 1 method)

function continutations_cache(grams)

cache = Dict()

for gram in grams

start = gram[1:end-1]

old_list = get(cache, start, [])

push!(old_list, gram[end])

cache[start] = old_list

end

cache

end

👀 Reading hidden code

1.0 ms

Dict{Any, Any}Dict

SubString{String}

"or"

"not"

Any

"to"

SubString{String}

"to"

"be"

Any

"or"

"that"

SubString{String}

"be"

"or"

Any

"not"

SubString{String}

"not"

"to"

Any

"be"

SubString{String}

"be"

"that"

Any

"is"

SubString{String}

"is"

"the"

Any

"question"

SubString{String}

"that"

"is"

Any

"the"

let

trigrams = ngrams(split("to be or not to be that is the question", " "), 3)

continutations_cache(trigrams)

end

👀 Reading hidden code

58.3 ms

Dict{Any, Any}Dict

SubString{String}

"cognates"

"in"

Any

"romance"

SubString{String}

"carolingian"

"scribes"

Any

"first"

SubString{String}

"latin"

"silva"

Any

"which"

SubString{String}

"portuguese"

"selva"

Any

"the"

SubString{String}

"floresta"

"etc"

Any

"are"

SubString{String}

"the"

"presence"

Any

"of"

SubString{String}

"expanse"

"covered"

Any

"by"

SubString{String}

"denote"

"the"

Any

"royal"

SubString{String}

"grow"

"trees"

Any

"in"

SubString{String}

"italian"

"spanish"

Any

"and"

SubString{String}

"is"

"commonly"

Any

"used"

SubString{String}

"many"

"definitions"

Any

"an"

SubString{String}

"completely"

"lacking"

Any

"trees"

SubString{String}

"royal"

"hunting"

Any

"grounds"

SubString{String}

"old"

"high"

Any

"german"

SubString{String}

"definition"

"with"

Any

"more"

SubString{String}

"used"

"there"

Any

"is"

SubString{String}

"the"

"medieval"

Any

"latin"

SubString{String}

"fores"

"denoting"

Any

"forest"

SubString{String}

"forest"

"and"

Any

"woodland"

SubString{String}

"into"

"english"

Any

"as"

SubString{String}

"as"

"the"

Any

"word"

SubString{String}

"past"

"will"

Any

"grow"

SubString{String}

"forest"

"was"

Any

"first"

SubString{String}

"more"

"than"

Any

"definitions"

SubString{String}

"or"

"old"

Any

"high"

SubString{String}

"introduced"

"into"

Any

"english"

SubString{String}

"wood"

"carolingian"

Any

"scribes"

SubString{String}

"foresta"

"spanish"

Any

"and"

SubString{String}

"of"

"trees"

Any

"under"

continutations_cache(ngrams_circular(forest_words, 3))

👀 Reading hidden code

13.5 ms

Exercise 2.4 - write a novel

We have everything we need to generate our own novel! The final step is to sample random ngrams, in a way that each next ngram overlaps with the previous one. We've done this in the function generate_from_ngrams below - feel free to look through the code, or to implment your own version.

👀 Reading hidden code

295 μs

generate_from_ngrams

generate_from_ngrams(grams, num_words)

Given an array of ngrams (i.e. an array of arrays of words), generate a sequence of num_words words by sampling random ngrams.

"""

generate_from_ngrams(grams, num_words)

Given an array of ngrams (i.e. an array of arrays of words), generate a sequence of `num_words` words by sampling random ngrams.

"""

function generate_from_ngrams(grams, num_words)

n = length(first(grams))

cache = continutations_cache(grams)

# we need to start the sequence with at least n-1 words.

# a simple way to do so is to pick a random ngram!

sequence = [rand(grams)...]

# we iteratively add one more word at a time

for i ∈ n+1:num_words

# the previous n-1 words

tail = sequence[end-(n-2):end]

# possible next words

continuations = cache[tail]

choice = rand(continuations)

push!(sequence, choice)

end

sequence

end

👀 Reading hidden code

1.6 ms

ngrams_circular

Compute the ngrams of an array of words, but add the first n-1 at the end, to ensure that every ngram ends in the the beginning of another ngram.

👀 Reading hidden code

778 μs

generate

generate(source_text::AbstractString, num_token; n=3, use_words=true)

Given a source text, generate a String that "looks like" the original text by satisfying the same ngram frequency distribution as the original.

👀 Reading hidden code

2.2 ms

Interactive demo

Enter your own text in the box below, and use that as training data to generate anything!

👀 Reading hidden code

233 μs

although the word forest is commonly used there is no universally recognised precise definition with more than  definitions of forest used around the world although a forest is usually defined by the presence of trees under many definitions an area completely lacking trees may still be considered a forest if it grew trees in the past will grow trees in the future or was legally designated as a forest regardless of vegetation typethe word forest derives from the old french forest also fores denoting forest vast expanse covered by trees forest was first introduced into english as the word denoting wild land set aside for hunting without the necessity in definition of having trees on the land possibly a borrowing probably via frankish or old high german of the medieval latin foresta denoting open wood carolingian scribes first used foresta in the capitularies of charlemagne specifically to denote the royal hunting grounds of the king the word was not endemic to romance languages e g native words for forest in the romance languages derived from the latin silva which denoted forest and woodland confer the english sylva and sylvan confer the italian spanish and portuguese selva the romanian silv and the old french selve and cognates in romance languages e g the italian foresta spanish and portuguese floresta etc are all ultimately derivations of the french word

👀 Reading hidden code

742 μs

Using grams for characters

👀 Reading hidden code

44.2 ms

tns lr i olvtlret lts lshren nauc l w et dlihovdeauhrfoceanntna ew ne t aatrec afs eg s dgtreeroon sioeg adi attbofateuli tew eesnv gelcc fefr aace eayrstangoaol tsetolrtoe lontxtlntonuos vwf ioi icndoia aua sicr at e ed nrtutiotloed oh kddvepliiloa ehapio a urw danebrnlohsdh a eh iiearwhoetmtop terea eihtr cro in g iof ec stffdtn ehiuniogr rlee stuhii httbafiaersnta fceaey oet e st

👀 Reading hidden code

265 ms

Using grams for words

👀 Reading hidden code

951 μs

designated the the sylva world ultimately denoting area although world charlemagne the typethe word expanse from expanse medieval g and french languages forest commonly old g the open be the used words romanian word presence from g usually the e spanish spanish scribes derived word of floresta the sylva fores land more word languages past french denote confer of the denoting confer typethe still old spanish defined via silv medieval is first the in completely possibly under introduced although english the the definitions trees of frankish wild definition used presence more still derivations for vast in in usually typethe

👀 Reading hidden code

7.5 ms

Automatic Jane Austen

Uncomment the cell below to generate some Jane Austen text:

👀 Reading hidden code

235 μs

, for you than your advice on the father ' s eye . ( I have been so complaisant and obliging to say to a great talker upon little matters , looked with smiling but determined decision , * is not much in the business to remember them ; and it was his jealousy of Frank in almost every line agreeable ; the sixteen miles distant . There , not one of the eye , showed him not to be , for he thoroughly understands the value of a third to cheer a long - standing , and obtained his

generate(emma, 100; n=3) |> Quote

👀 Reading hidden code

226 ms

👀 Reading hidden code

9.1 μs

SubString{String}

"Emma"

"Woodhouse"

","

"handsome"

","

"clever"

","

"and"

"rich"

","

"with"

"a"

"com"

"-"

"fortable"

"home"

"and"

"happy"

"disposition"

","

195682

"in"

195683

"the"

195684

"perfect"

195685

"happiness"

195686

"of"

195687

"the"

195688

"union"

195689

"."

195690

"THE"

195691

"END"

austen_words = splitwords(emma)

👀 Reading hidden code

61.7 ms

Vector{SubString{String}}

SubString{String}

"Emma"

"Woodhouse"

","

SubString{String}

"Woodhouse"

","

"handsome"

SubString{String}

","

"handsome"

","

SubString{String}

"handsome"

","

"clever"

SubString{String}

","

"clever"

","

SubString{String}

"clever"

","

"and"

SubString{String}

","

"and"

"rich"

SubString{String}

"and"

"rich"

","

SubString{String}

"rich"

","

"with"

SubString{String}

","

"with"

"a"

SubString{String}

"with"

"a"

"com"

SubString{String}

"a"

"com"

"-"

SubString{String}

"com"

"-"

"fortable"

SubString{String}

"-"

"fortable"

"home"

SubString{String}

"fortable"

"home"

"and"

SubString{String}

"home"

"and"

"happy"

SubString{String}

"and"

"happy"

"disposition"

SubString{String}

"happy"

"disposition"

","

SubString{String}

"disposition"

","

"seemed"

SubString{String}

","

"seemed"

"to"

195682

SubString{String}

"in"

"the"

"perfect"

195683

SubString{String}

"the"

"perfect"

"happiness"

195684

SubString{String}

"perfect"

"happiness"

"of"

195685

SubString{String}

"happiness"

"of"

"the"

195686

SubString{String}

"of"

"the"

"union"

195687

SubString{String}

"the"

"union"

"."

195688

SubString{String}

"union"

"."

"THE"

195689

SubString{String}

"."

"THE"

"END"

195690

SubString{String}

"THE"

"END"

"Emma"

195691

SubString{String}

"END"

"Emma"

"Woodhouse"

austen_grams = ngrams_circular(austen_words, 3)

👀 Reading hidden code

24.9 ms

Dict{Any, Any}Dict

SubString{String}

"be"

"out"

Any

"of"

","

"of"

"of"

"of"

"of"

SubString{String}

"great"

"pity"

Any

"that"

"indeed"

";"

SubString{String}

"assembled"

"there"

Any

","

SubString{String}

"she"

"asked"

Any

"."

"me"

","

SubString{String}

"could"

"wish"

Any

".'"

","

"to"

"for"

".'"

"."

"for"

"."

"to"

"."

"to"

SubString{String}

"long"

"in"

Any

"the"

"equal"

"the"

"coming"

"want"

"reaching"

SubString{String}

"),"

"he"

Any

"would"

SubString{String}

"introducing"

"Robert"

Any

"Martin"

SubString{String}

"had"

"better"

Any

"not"

"say"

"take"

"let"

"not"

"look"

"order"

"come"

"be"

"sense"

SubString{String}

"our"

"visit"

Any

"?"

SubString{String}

"been"

"waiting"

Any

"the"

SubString{String}

"thought"

"them"

Any

"the"

"all"

"a"

SubString{String}

"do"

"good"

Any

"to"

SubString{String}

"am"

"of"

Any

"Hartfield"

"being"

SubString{String}

"My"

"representa"

Any

"-"

SubString{String}

"doubt"

"he"

Any

"is"

"was"

"would"

"did"

SubString{String}

"only"

"wait"

Any

"till"

SubString{String}

"bear"

"her"

Any

"well"

SubString{String}

"grate"

"-"

Any

"ful"

SubString{String}

"environs"

"could"

Any

"be"

SubString{String}

"expense"

"or"

Any

"inconvenience"

SubString{String}

"Cole"

"—"

Any

"he"

SubString{String}

"merriment"

"it"

Any

"would"

SubString{String}

"with"

"astonishment"

Any

"at"

SubString{String}

"Till"

"you"

Any

"chose"

SubString{String}

"s"

"thoughts"

Any

"all"

"from"

"a"

","

SubString{String}

"arrived"

"from"

Any

"Broadwood"

"Mr"

SubString{String}

"commend"

"a"

Any

"baked"

SubString{String}

"be"

"spoken"

Any

"to"

SubString{String}

"oppress"

"her"

Any

".'"

austen_cache = continutations_cache(austen_grams)

👀 Reading hidden code

149 ms

current = let

start_over

[rand(austen_grams)...]

end;

👀 Reading hidden code

33.0 μs

👀 Reading hidden code

41.8 ms

👀 Reading hidden code

519 ms

👀 Reading hidden code

9.2 μs

deduplicate_and_sort_by_occurance (generic function with 1 method)

function deduplicate_and_sort_by_occurance(xs)

counts = collect(word_counts(xs))

sort(counts, by=last, rev=true) .|> first |> collect

end

👀 Reading hidden code

679 μs

Function library

Just some helper functions used in the notebook.

👀 Reading hidden code

254 μs

Quote (generic function with 1 method)

👀 Reading hidden code

524 μs

show_pair_frequencies (generic function with 1 method)

👀 Reading hidden code

1.1 ms

compimg (generic function with 2 methods)

👀 Reading hidden code

3.3 ms

hint (generic function with 1 method)

👀 Reading hidden code

468 μs

almost (generic function with 1 method)

👀 Reading hidden code

959 μs

still_missing (generic function with 2 methods)

👀 Reading hidden code

804 μs

keep_working (generic function with 2 methods)

👀 Reading hidden code

834 μs

Markdown.MD

Fantastic!

Splendid!

Great!

Yay ❤

Great! 🎉

Well done!

Keep it up!

Good job!

Awesome!

You got the right answer!

Let's move on to the next section.

👀 Reading hidden code

10.5 ms

correct (generic function with 2 methods)

👀 Reading hidden code

722 μs

not_defined (generic function with 1 method)

👀 Reading hidden code

792 μs

👀 Reading hidden code

1.8 ms

In the cloud (experimental)

On your computer

Frontmatter

Preview

The interactive word picker is at the bottom of this notebook!

Homework 3: Structure and Language

Exercise 1: Language detection

Exercise 1.1 - Cleaning data

Exercise 1.2 - Letter frequencies

Exercise 1.3 - Transition frequencies

Exercise 1.4 - Language detection

Mystery sample

Further reading

Exercise 2 - Language generation

Dataset

Exercise 2.1 - bigrams revisited

Exercise 2.2 - frequency matrix revisisted

Exercise 2.4 - write a novel

Interactive demo

Automatic Jane Austen

"Emma" – rearranged

Choose the next word!

Function library